Metabarcoding exploratory data analysis
In this notebook we will analyze PacMAN pipeline results for a subset of COI samples from Rey, A., Basurko, O. C., & Rodriguez-Ezpeleta, N. (2020). Considerations for metabarcoding-based port biological baseline surveys aimed at marine nonindigenous species monitoring and risk assessments. Ecology and Evolution, 10(5), 2452–2465. https://doi.org/10.1002/ece3.6071
Running this notebook
To run this notebook in its entirety or to have access to the data
files used in this notebook, clone or download and extract the iobis/pacman-pipeline-training
repository. The notebook can be found in the
datasets/rey/analysis directory, open this directory in
RStudio after downloading (File >
Open Project or navigate to Desktop in the Files panel and
select More >
Set As Working Directory).
If you have trouble locating the Desktop folder in RStudio, set your
Desktop as the default working directory using Tools >
Global Options >
Default working directory.
Importing data
The PacMAN pipeline exports results as a phyloseq object
which can be imported into R for analysis. First install the phyloseq
package from Bioconductor:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phyloseq")This is how phyloseq objects are structured:
Read the phyloseq object and verify that we have:
- a OTU table
- a sample table
- a taxonomy table
A phylogenetic tree has not been calculated so that slot is empty.
physeq <- readRDS("../results_noblast/phyloseq_object.rds")
physeq## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5327 taxa and 22 samples ]
## sample_data() Sample Data: [ 22 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 5327 taxa by 14 taxonomic ranks ]
While we can access these tables individually (for example using
physeq@sam_data or sample_data(physeq)), the
phyloseq package also has a function psmelt() to convert
the phyloseq object into a single large data frame:
library(dplyr)
library(phyloseq)
df <- psmelt(physeq) %>%
# convert empty strings to NA across all character columns
mutate_if(is.character, ~na_if(., "")) %>%
# convert to tibble for prettier printing
as_tibble()
df## # A tibble: 117,194 × 37
## OTU Sample Abund…¹ eventID mater…² event…³ verba…⁴ local…⁵ verba…⁶ event…⁷
## <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 asv.2 SAMN1… 25147 SAMN10… SAMN10… settle… 43.28 … Spain:… 7m 2017-1…
## 2 asv.7 SAMN1… 16275 SAMN10… SAMN10… settle… 43.28 … Spain:… 7m 2017-0…
## 3 asv.8 SAMN1… 15862 SAMN10… SAMN10… settle… 43.37 … Spain:… 7m 2017-0…
## 4 asv.6 SAMN1… 15401 SAMN10… SAMN10… filter… 43.34 … Spain:… surface 2017-0…
## 5 asv.5 SAMN1… 12944 SAMN10… SAMN10… settle… 43.37 … Spain:… 7m 2017-1…
## 6 asv.11 SAMN1… 12292 SAMN10… SAMN10… settle… 43.34 … Spain:… 7m 2017-0…
## 7 asv.14 SAMN1… 11142 SAMN10… SAMN10… filter… 43.37 … Spain:… surface 2017-0…
## 8 asv.10 SAMN1… 10447 SAMN10… SAMN10… filter… 43.37 … Spain:… surface 2017-0…
## 9 asv.3 SAMN1… 10377 SAMN10… SAMN10… settle… 43.34 … Spain:… 7m 2017-1…
## 10 asv.18 SAMN1… 9850 SAMN10… SAMN10… settle… 43.37 … Spain:… 7m 2017-0…
## # … with 117,184 more rows, 27 more variables: occurrenceStatus <chr>,
## # target_gene <chr>, subfragment <chr>, pcr_primer_forward <chr>,
## # pcr_primer_reverse <chr>, pcr_primer_name_forw <chr>,
## # pcr_primer_name_reverse <chr>, pcr_primer_reference <chr>,
## # lib_layout <chr>, seq_meth <chr>, sop <chr>, votu_db <chr>, My_name <chr>,
## # kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
## # genus <chr>, species <chr>, lastvalue <chr>, otu_seq_comp_appr <chr>, …
Exploratory data analysis
Abundance by phylum
Let’s first determine which are the most abundant phyla across all
samples. We will use this information to bin the less abundant phyla
into a category Other for simplifying some of our
graphics.
stats_phyla <- df %>%
group_by(phylum) %>%
summarize(abundance = sum(Abundance)) %>%
arrange(desc(abundance))
top_phyla <- stats_phyla$phylum %>% na.omit() %>% head(8)
top_phyla## [1] "Arthropoda" "Cnidaria" "Bacillariophyta" "Annelida"
## [5] "Bryozoa" "Nemertea" "Chlorophyta" "Mollusca"
Now create a color scale for these phyla using the
RColorBrewer package:
phylum_colors <- RColorBrewer::brewer.pal(9, "Spectral")
names(phylum_colors) <- c(top_phyla, "Other")
phylum_colors## Arthropoda Cnidaria Bacillariophyta Annelida Bryozoa
## "#D53E4F" "#F46D43" "#FDAE61" "#FEE08B" "#FFFFBF"
## Nemertea Chlorophyta Mollusca Other
## "#E6F598" "#ABDDA4" "#66C2A5" "#3288BD"
And add a new column with binned phyla:
df <- df %>%
mutate(phylum_binned = ifelse(is.na(phylum) | phylum %in% top_phyla, phylum, "Other")) %>%
mutate(phylum_binned = factor(phylum_binned, levels = c(top_phyla, "Other")))Let’s also add the binned phyla to the phyloseq object
for later use:
# remotes::install_github("david-barnett/microViz")
physeq <- physeq %>%
microViz::tax_mutate(phylum = ifelse(phylum == "", NA, phylum)) %>%
microViz::tax_mutate(phylum_binned = ifelse(is.na(phylum) | phylum %in% top_phyla, phylum, "Other"))Abundance by sample type and phylum
First list the abundance by phylum and sample type:
library(tidyr)
stats_type_phylum <- df %>%
group_by(eventRemarks, phylum) %>%
summarize(abundance = sum(Abundance))
spread(stats_type_phylum, eventRemarks, abundance) %>%
mutate(total = `settlement plates` + `filtered water`) %>%
arrange(desc(total)) %>%
knitr::kable()| phylum | filtered water | settlement plates | total |
|---|---|---|---|
| NA | 353725 | 126557 | 480282 |
| Arthropoda | 13857 | 99041 | 112898 |
| Cnidaria | 3603 | 60871 | 64474 |
| Bacillariophyta | 45861 | 979 | 46840 |
| Annelida | 8706 | 15085 | 23791 |
| Bryozoa | 454 | 22631 | 23085 |
| Nemertea | 7 | 20234 | 20241 |
| Chlorophyta | 17922 | 10 | 17932 |
| Mollusca | 3801 | 6219 | 10020 |
| Haptista | 4682 | 13 | 4695 |
| Rhodophyta | 3082 | 197 | 3279 |
| Rotifera | 1926 | 56 | 1982 |
| Oomycota | 1303 | 414 | 1717 |
| Basidiomycota | 1481 | 18 | 1499 |
| Chordata | 1213 | 211 | 1424 |
| Platyhelminthes | 11 | 1382 | 1393 |
| Prasinodermophyta | 1071 | 0 | 1071 |
| Echinodermata | 767 | 53 | 820 |
| Ascomycota | 733 | 3 | 736 |
| Streptophyta | 444 | 0 | 444 |
| Porifera | 281 | 115 | 396 |
| Discosea | 173 | 63 | 236 |
| Nematoda | 21 | 95 | 116 |
| Placozoa | 96 | 0 | 96 |
| Phoronida | 57 | 18 | 75 |
| Evosea | 62 | 7 | 69 |
| Gastrotricha | 6 | 32 | 38 |
| Picozoa | 28 | 0 | 28 |
| Chytridiomycota | 18 | 0 | 18 |
| Sipuncula | 12 | 0 | 12 |
| Imbricatea | 0 | 7 | 7 |
| Chaetognatha | 5 | 0 | 5 |
| Entoprocta | 0 | 4 | 4 |
| Tubulinea | 3 | 0 | 3 |
| Onychophora | 2 | 0 | 2 |
| Zoopagomycota | 2 | 0 | 2 |
Now create a graph using the binned phyla:
library(ggplot2)
stats_type_phylum <- df %>%
# calculate total abundance by sample type
group_by(eventRemarks) %>%
mutate(abundance = Abundance / sum(Abundance)) %>%
group_by(eventRemarks, phylum_binned) %>%
summarize(abundance = sum(abundance))
ggplot() +
geom_bar(data = stats_type_phylum, aes(y = eventRemarks, fill = phylum_binned, x = abundance), stat = "identity") +
scale_fill_manual(values = phylum_colors, na.value = "#eeeeee") +
theme_minimal()Abundance (reads) by sample and phylum (optional)
stats_sample_phylum <- df %>%
# calculate total abundance by sample type
group_by(Sample) %>%
mutate(relative_abundance = Abundance / sum(Abundance)) %>%
group_by(Sample, eventRemarks, phylum_binned) %>%
summarize(relative_abundance = sum(relative_abundance), abundance = sum(Abundance))Absolute reads abundance:
ggplot() +
geom_bar(data = stats_sample_phylum, aes(y = Sample, fill = phylum_binned, x = abundance), stat = "identity") +
scale_fill_manual(values = phylum_colors, na.value = "#eeeeee") +
theme_minimal() +
theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())Relative reads abundance:
ggplot() +
geom_bar(data = stats_sample_phylum, aes(y = Sample, fill = phylum_binned, x = relative_abundance), stat = "identity") +
scale_fill_manual(values = phylum_colors, na.value = "#eeeeee") +
theme_minimal() +
theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())Most abundant species
Let’s list the species with the highest relative abundance across all samples:
top_species <- df %>%
filter(!is.na(species)) %>%
group_by(species) %>%
summarize(abundance = sum(Abundance)) %>%
arrange(desc(abundance)) %>%
head(20)
top_species %>% knitr::kable()| species | abundance |
|---|---|
| Balanus trigonus | 37896 |
| Bougainvillia muscus | 33490 |
| Minutocellus polymorphus | 27149 |
| Obelia dichotoma | 19006 |
| Emplectonema gracile | 16805 |
| Micromonas pusilla | 15598 |
| Dictyocha speculum | 13538 |
| Bugula neritina | 10243 |
| Harpyia umbrosa | 9099 |
| Cutleria multifida | 5503 |
| Ostrea stentina | 5267 |
| Polydora neocaeca | 5066 |
| Paracalanus parvus | 4374 |
| Sabellaria spinulosa | 3627 |
| Amphibalanus eburneus | 3567 |
| Palmaria decipiens | 2398 |
| Amphinema dinema | 2084 |
| Chloroparvula pacifica | 1528 |
| Pseudochattonella farcimen | 1509 |
| Campanularia hincksii | 1478 |
For the most abundant species, plot the abundance by sample:
stats <- df %>%
filter(species %in% top_species$species) %>%
group_by(species, eventRemarks, Sample) %>%
summarize(abundance = sum(Abundance))
ggplot() +
geom_jitter(data = stats, aes(y = species, x = abundance, color = eventRemarks), pch = 21, height = 0.1, width = 0) +
scale_color_brewer(palette = "Set1") +
theme_minimal()Invasive species
In this section we will check our results against the World Register of Introduced Marine Species (WRiMS). The repository contains a CSV file containing the LSIDs for all species in WRiMS. Read this CSV and use the LSIDs to only retain species from WRiMS.
wrims_lsids <- read.csv("wrims_lsids.csv")
invasives <- df %>%
filter(lsid %in% wrims_lsids$lsid)
invasives %>%
group_by(phylum, species) %>%
summarize(abundance = sum(Abundance)) %>%
arrange(desc(abundance)) %>%
rmarkdown::paged_table()Alpha diversity
Use vegan::plot_richness() to visualize the number of
ASVs per sample:
plot_richness(physeq, measures = c("Observed"), color = "eventRemarks")The vegan package can also calculate a number of biodiversity indices. But keep in mind that some of these will use the reads abundance which does not necessarily correlate with actual abundance in terms of individuals, such as the Chao1 estimator which uses the numbers of singletons and doubletons to obtain a lower bound for the expected asymptotic species richness, or Shannon entropy which uses relative abundance.
estimate_richness(physeq)## Observed Chao1 se.chao1 ACE se.ACE Shannon
## SAMN10847219 731 731.0612 0.262882645 731.6197 13.507301 5.231046
## SAMN10847223 654 654.0096 0.100160727 654.5405 11.803509 4.204983
## SAMN10847226 529 529.1538 0.422097183 530.5108 11.452192 4.085992
## SAMN10847228 363 363.2500 0.580982709 363.9410 9.507433 3.291294
## SAMN10847229 407 407.0256 0.169157714 407.5179 10.070721 3.469086
## SAMN10847231 719 719.0000 0.010862004 719.2273 13.321963 4.349963
## SAMN10847249 284 284.0000 0.013864415 284.2858 8.422944 3.337518
## SAMN10847251 483 483.0182 0.140266223 483.5069 10.838516 4.407202
## SAMN10847253 372 372.0588 0.257092009 372.8648 9.601941 2.782583
## SAMN10847255 407 407.0000 0.014268153 407.2320 10.079683 3.696802
## SAMN10847264 258 258.0000 0.000000000 258.0000 6.489407 3.189349
## SAMN10847265 337 337.0000 0.008052542 337.2954 8.390622 4.633680
## SAMN10847266 392 392.0000 0.000000000 392.0000 8.630747 3.276873
## SAMN10847267 211 211.0000 0.020783907 211.2352 7.158267 2.975029
## SAMN10847272 311 311.0000 0.013136724 311.2671 8.813702 3.695959
## SAMN10847276 687 687.0000 0.005613887 687.2582 12.784487 4.464019
## SAMN10847333 298 298.0000 0.000000000 298.0000 8.631144 3.434833
## SAMN10847336 193 193.0000 0.000000000 193.0000 6.877914 2.778169
## SAMN10847345 399 399.0000 0.000000000 399.0000 9.948993 3.479036
## SAMN10847348 289 289.0000 0.000000000 289.0000 8.320983 3.581685
## SAMN10847357 130 130.0000 0.000000000 130.0000 5.395155 2.745419
## SAMN10847359 216 216.0000 0.000000000 216.0000 7.328281 2.052828
## Simpson InvSimpson Fisher
## SAMN10847219 0.9853373 68.200038 137.79902
## SAMN10847223 0.9587094 24.218612 118.09885
## SAMN10847226 0.9419372 17.222718 90.67119
## SAMN10847228 0.8741244 7.944350 57.16947
## SAMN10847229 0.9131446 11.513387 60.37124
## SAMN10847231 0.9441269 17.897686 115.89380
## SAMN10847249 0.9072691 10.783896 44.15926
## SAMN10847251 0.9713287 34.878095 84.47387
## SAMN10847253 0.6952500 3.281378 60.55650
## SAMN10847255 0.9371758 15.917441 62.94423
## SAMN10847264 0.8099860 5.262770 55.31375
## SAMN10847265 0.9771990 43.857680 75.87903
## SAMN10847266 0.8726964 7.855237 74.01503
## SAMN10847267 0.8690806 7.638287 32.16280
## SAMN10847272 0.9359133 15.603862 50.10914
## SAMN10847276 0.9645337 28.195777 117.72100
## SAMN10847333 0.9222313 12.858640 39.56879
## SAMN10847336 0.8182685 5.502624 27.12397
## SAMN10847345 0.9346967 15.313164 54.47372
## SAMN10847348 0.9344302 15.250929 39.89149
## SAMN10847357 0.8898749 9.080580 15.58998
## SAMN10847359 0.6503482 2.859988 29.61067
ASV accumulation curves
Rarefaction is an interpolation of species richness to smaller samples sizes to allow comparisons between samples.
vegan::rarecurve(t(otu_table(physeq)), step = 1000)Do it yourself (optional)
Alternatively, run vegan::rarefy() for every sample and
construct the graph yourself:
community <- t(otu_table(physeq))
curves <- purrr::map(physeq@sam_data$eventID, function(sample) {
community_row <- community[sample,]
sample_sizes <- seq(1000, sum(community_row), by = 1000)
rarefied <- vegan::rarefy(community_row, sample_sizes)
curve <- data.frame(
eventID = sample,
sample_size = sample_sizes,
asvs = as.numeric(rarefied)
) %>%
left_join(physeq@sam_data, by = "eventID")
}) %>%
bind_rows()
samples <- curves %>%
group_by(eventID) %>%
summarize(sample_size = max(sample_size), asvs = max(asvs))
ggplot() +
geom_line(data = curves %>% group_by(eventID), aes(x = sample_size, y = asvs, color = eventRemarks, group = eventID)) +
geom_label(data = samples, aes(sample_size, asvs, label = eventID), size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic()Beta diversity
Multidimensional scaling with phyloseq
Phyloseq’s ordinate() function allows us to do
MultiDimensional Scaling (MDS) or Principal Coordinate Analysis (PCoA)
on our dataset. The default for this function is to use Bray-Curtis
Dissimilarity to calculate a distance matrix.
ord <- ordinate(physeq = physeq, method = "PCoA", distance = "bray")
plot_ordination(physeq = physeq, ordination = ord, type = "samples", color = "eventRemarks", shape = "locality") +
geom_point(aes(color = eventRemarks), alpha = 1, size = 4) +
geom_text(aes(label = materialSampleID), alpha = 0.7, size = 3, vjust = 2) +
stat_ellipse(aes(group = eventRemarks)) +
scale_color_brewer(palette = "Set1") +
scale_shape(solid = TRUE) +
theme_classic()Adding taxa (optional)
We can also add taxa to the ordination plot, but styling this is a bit awkward as samples and taxa are treated as a single dataset.
plot_ordination(physeq = subset_taxa(physeq, !is.na(phylum)), ordination = ord, type = "biplot", color = "phylum_binned") +
scale_color_manual(values = phylum_colors, na.value = "#000000") +
geom_text(aes(label = materialSampleID), alpha = 0.7, size = 3, vjust = 2) +
theme_classic()Non-metric MDS (NMDS) with phyloseq (optional)
Another ordination method is non-metric MDS (NMDS). Let’s try NMDS with our dataset aggregated at the species level, and plot the most abundant species together with the sites.
First aggregates the phyloseq object to the species
level:
physeq_species <- tax_glom(physeq, taxrank = "species", NArm = TRUE, bad_empty = c(NA, ""))Then perform the ordination with method NMDS:
ord_species <- ordinate(physeq_species, method = "NMDS", distance = "bray")Extract the NMDS values for the sites and species, and join with the
relevant tables from the phyloseq object to get the
necessary metadata for plotting (such as species name, locality, and
sample type):
sites <- ord_species$points %>%
as.data.frame() %>%
tibble::rownames_to_column("site") %>%
inner_join(physeq_species@sam_data %>% as.data.frame() %>% tibble::rownames_to_column("site"), by = "site")
species <- ord_species$species %>%
as.data.frame() %>%
tibble::rownames_to_column("asv") %>%
left_join(physeq_species@tax_table %>% as.data.frame() %>% tibble::rownames_to_column("asv"), by = "asv") %>%
filter(species %in% top_species$species)ggplot() +
geom_segment(data = species, aes(x = 0, y = 0, xend = MDS1, yend = MDS2), arrow = arrow(length = unit(2, "mm")), color = "#aaaaaa") +
geom_point(data = sites, aes(MDS1, MDS2, color = eventRemarks, shape = locality), size = 3, stroke = 1.5) +
geom_text(data = species, aes(MDS1, MDS2, label = species), size = 3, vjust = "outward") +
scale_color_brewer(palette = "Set1") +
scale_shape(solid = FALSE) +
theme_classic()